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ABSTRACT 


In the past, analytical and measurement-based models have been developed to char- 
acterize computer system behavior. An open issue has been how these models can be 
used, if at all, for system design improvement. This thesis attempts to address this issue. 
It proposes a combined statistical/analytical approach to use measurements from one 
environment to model the system failure behavior in a new environment. A comparison 
of the predicted results with the actual data from the new environment shows a close 
correspondence. 
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1. INTRODUCTION 


1.1 Motivation 

The evaluation of computer system performability is an important research issue. 
The need for both higher performance and dependability motivates the development of 
accurate and powerful models to aid in system design, tuning, and reliability evaluation. 
This thesis is concerned with developing measurement-based performability models and 
the use of such models for predicting system behavior under new and yet unmeasured 
conditions. 

Previous research [1], [2] has shown that system failure rate is dependent on resource 
usage and that increased resource usage is accompanied by increased failure. Physically, 
this dependence can be caused by several factors: Increased usage can result in higher 
probability of detecting faults as more of the system is exercised. Increased usage can 
also result in more stress on the hardware, in the form of higher levels of electronic noise, 
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temperature, and device and mechanical stress. Finally, increased usage can result in 
increased likelihood of h uman error which can lead to system failure. 

The dependence of reliability on resource usage suggests that error/failure behavior 
may be predicted. By observing the resource usage behavior and failure occurrences 
of a system over some period of time, the interaction between workload and reliability 
can be characterized. Using this characterization, the failure behavior of a new usage 
environment may be predicted, given knowledge of the workload characteristics of that 
environment. This thesis characterizes the relationship between usage and failure by 
building measurement-based state-transition models, and presents a method for using 
these models to evaluate the reliability of the system under specified usage characteristics. 

1.2 Related Research 

Past studies have focused heavily on developing analytical models for system failure 
[3], [4], [5], [6]. The approach has been to assume a distribution for the time to failure of 
various system components. Whether the modeling technique is combinatorial or Markov, 
exponential distributions have been typically assumed because of the tractability of the 
resulting models. However, this assumption is in general not supported by real data. 

On the other hand, field measurements have been used in several studies. The con- 
sensus is that system failure is resource-usage dependent. Early studies based on real 
measurements [1], [2], [7], [8] show that the operational environment of a system is an 


3 


important factor in predicting its reliability. Increased system failure rates due to in- 
creased utilization have been documented and modeled [2], [9]. Results in [10] indicate 
that CPU-related failures increase exponentially with resource usage after the system uti- 
lization reaches a saturation point. A fault prediction method based on failure patterns 
in error log files has been explored in [11]. Performability models which combine analyt- 
ical modeling and measurements have been developed [12]. The study in [13] describes 
a diagnostic methodology for detecting anomalous behaviors of a network environment. 
A failure prediction method based on intermittent error characteristics has been inves- 
tigated in [14]. Studies based on real data not only provide accurate quantification of 
system dependability but also reflect dynamic changes in system behavior. Several ana- 
lytical models that take into account resource-usage effects have recently been proposed 
[15], [16], 

This thesis constructs performability models based on real data as in [12] but goes 
further by developing a methodology to predict reliability using the performability mod- 
els. 

1.3 Overview 

Based on the assumption that system failure is resource-usage dependent and given 
that resource-usage information is available, failure characteristics should be predictable. 
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This thesis investigates this predictability. Based on measurements of the resource- 
usage/error /failure behavior of a system, we attempt to predict the failure behavior 
of a new environment for which only the resource-usage information is known. 

In particular, resource usage and error-related activities are first recorded over a 
period of t im e. Using the recorded data, a state- transition mode is constructed to rep- 
resent resource-usage behavior. This model is then extended to include the recorded 
error/failure behavior. The relationship between the usage-only model and the full us- 
age/error model is characterized empirically by regressing the parameters which describe 
error/failure rates on the resource usage indices (e.g., CPU utilization). Reliability under 
a new usage environment is then predicted by using the regression relations to estimate 
the error/failure rates corresponding to the new usage environment. In effect, the us- 
age model of the new environment is extended to the full resource- usage/error/recovery 
model through regression-based estimation. 

♦ 

The remainder of the thesis is organized as follows. Chapter 2 discusses the modeling 
and prediction method in detail. Chapter 3 validates the method by applying it to 
an independent data set. Chapter 4 summarizes the method, offers some concluding 
remarks, and explores possibilities for future work. 


2. MODELING AND PREDICTION METHODOLOGY 


The prediction method is developed on data collected from an IBM 3081 system, but 
it does not use any information that is unique to the particular IBM system. Therefore, 
the approach is system-independent, and in principle can be applied to other systems as 
well. In essence, the method consists of four steps: 

. ,~r 

1. Collect resource-usage and error/failure/recovery data on a system for different 
periods of time. 


2. Using the measured data, construct the system resource-usage model and its cor- 
responding resource- usage/error/recovery model for each period of time. 

3. The models identified above are then used to derive, via nonlinear regression, em- 
pirical relationships between resource-usage indices (e.g., CPU utilization) model 
parameters as well as other necessary auxiliary relationships. 
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4. The derived regression models are used to extrapolate the system failure behavior 
of a new environment by predicting the resource- usage/error /recovery model that 
corresponds to the resource-usage model of the new environment. 

2.1 Measured Data 

The data used to develop the proposed methodology are collected from an IBM 3081 
dual processor and channel system r unning under the MVS operating system over a period 
of about three months. The resource usage data are recorded by the IBM MVS/370 
system Resource Management Facility. The sampling time is 0.5 seconds. Every hour, 
the average values are computed for each index and stored to depict resource usage of the 
particular hour. The resource-usage indices that describe the state of the system include, 
CPU utilization (the percent of time that the CPU is executing instructions), channel 
busy (the fraction of time that the channel is busy and the CPU is waiting, which reflects 
the memory contention), I/O usage (the number of successful Start I/O and Resume I/O 
instructions issued to the channel), and disk usage (the number of requests serviced on 
the direct access storage devices). 

The error and recovery data are logged by the operating system. At every occurrence 
of an error, the operating system records the time, description, system status, and recov- 
ery attempts associated with the error. Because the manner in which errors are detected 
and reported in a system, a single fault may manifest itself as more than one error, de- 
pending on the activity at the time of the error. To address this problem, errors occur in 
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Table 2.1: IBM3081 Data Intervals 
interval span no. of obs. no. of err. 


1 

1/7/85 

- 1/18/85 


151 

2 

1/28/85 

- 2/4/85 

100 

187 

3 

2/4/85 

- 2/22/85 


204 

4 

2/22/85 

- 3/15/85 

100 

184 


close succession (within five minutes) are merged together [12]. The resulting error data 
are classified as CPU-related errors, channel errors, software errors, direct access storage 
device errors, and multiple errors. Multiple errors are identified for instances in which 
different types of error occur in close succession, but due to a common cause. 

The data are divided into four temporal intervals. Models are constructed for each 
interval. Models from the first three intervals are used in the regression. The last interval 
is reserved for verification of prediction results. The data set, divided into intervals, is 

shown in Table 2.1. 

♦ 

2.2 Workload/Failure Model 

Having divided the available data, the next step is to use the data to identify state 
transition models for each interval to quantify system characteristics. This is similar 
.to the analysis performed in [12]. First, the normal resource-usage model is identified. 
The set of resource- usage data depicts the state of the system at each time interval 
in n-dimensional space; each dimension describes one aspect of the system status (CPU 
utilization can be one of the dimensions). The set of data may potentially have an infinite 
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number of n-dimensionad vectors. To make the problem manageable, cluster analysis is 
used to summarize the data. The n-dimensional vectors are partitioned into a small 
n um ber of groups or clusters suggested by the data. The vectors in the same cluster tend 
to be similar in various dimensions; those in different clusters tend to be dissimilar. 

A k-means clustering algorithm is used in this study. The k-means algorithm min- 
imizes the sum of the squares of the Euclidean distances between the members of each 
cluster and the centroid in the dimensions of the clustering variable and, at the same 
time, maximizes the intercluster centroid distances. In essence, observations that are 
spatially close in the dimension(s) of the clustering variable(s) are grouped into the same 
clusters and represented by the centroids of the clusters which are determined by the 
means of the members of the cluster. 

In this study, the effect of CPU utilization on failure is of interest. The two di- 
mensions used for cluster analysis are CPU utilization and channel busy. Hence, those 
observations in the same cluster have similar CPU utilization and channel busy charac- 
terizations. Channel busy is used in addition to CPU utilization because channel busy is 
inversely related to CPU utilization and haw the effect of improving the clustered parti- 
tions. Table 2.2 shows the result of the k-means algorithm applied on the first interval. 
The data are grouped into four clusters; the numbers of observations that belong to each 
cluster are 9, 5, 25, and 61. The centroids cire defined by the means of the clustering 
variables (CPU utilization and ch ann el busy) over the observations in the cluster. For 
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Table 2.2: An Example of CPU Bound Clustering 


Cluster 

No. of observations 

CPU utilization 

Channel busy 

1 

9 

0.233 

0.133 

2 

5 

0.244 

0.371 

3 

25 

0.667 

0.099 

4 

61 

0.961 

0.114 


example, Cluster 1 is defined by a CPU utilization of 0.233 and channel busy of 0.133, 
the mean CPU utilization and channel busy over the 9 observations in that cluster. . 

After the resource usage data are clustered, each cluster is used to depict a system 
state, and a state transition model is constructed. Each state is defined by the centroid 
of the cluster in terms of the clustering variables. A state transition model is defined by 
interstate transition probability and holding time distribution. The interstate transition 
‘ probability py between any two states is defined to be 

observed number of transitions from state i to state j 

Pij = 

observed number of transitions from state i 

Observe that consecutive observations belonging to the same state are not defined as 
transitions. In addition, a “nonmeasured” state is defined to represent time intervals 
for which measurements have not been recorded. Figure 2.1 is an example of a state 
transition diagram. The arrows originating from states 3 and 4, but not terminating on 
any of the other states, indicate transitions to the “nonmeasured” state. Similarly, the 
fact that state 2 is not entered from any of the other states shown, indicates that it is 
entered from the “nonmeasured” state. The holding time between any two states, i.e., 
the time the process remains in a state before it makes a transition to another state, is 
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Figure 2.1: Resource-Usage State Transition Diagram Corresponding to Table 2.2 
defined by the distribution Thus, the state transition model of resource-usage is 

generated. The model can give insights to other important parameters of the system. 

In a similar fashion, the measured error events are classified into five different cat- 
egories: CPU error, software error, channel error, DASD error, and multiple error, as 
suggested by the data. The recovery procedures are also divided into categories based 
on recovery cost which is measured by the system overhead required to handle an error. 
At the lowest cost level is the hardware recovery which uses am error correction code or 
hardware instruction retry. At the second level is the software recovery which involves 
software-initiated recovery. At the highest level, no recovery is possible; the system has 
to be brought down for repair. 

As a finad step in model construction, the normal resource-usage, error, and recovery 
models are combined into one unified model. Figure 2.2 shows am example of the resource- 
usage/error/recovery model that corresponds to the resource-usage model in Figure 2.1. 
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Table 2.3: Transition Probabilities from Resource-Usage States to Error States for Fig- 
ure 2.2 


State 

CPU utilization 

CPU 

Channel 

DASD 

Software 

Multiple 

1 

0.233 

0 


0.140 

0.290 

0.210 

2 

0.245 

0 

0 

0.500 

0.167 

0.083 

3 

0.667 

0 

0.046 

0.538 

0.123 

0.046 

4 

0.961 

0 

0.011 

0.543 

0.202 

0.128 


To preserve the clarity of the figure, the transition probabilities from normal workload 
states to error states are not shown, but instead are listed in Table 2.3. 

It is very important to note that a resource-usage model contains only normal resource- 
usage states and a resource-usage/error /recovery model contains error and/or recovery 
states in addition to the normal resource-usage states. 

2.3 Parameters of Semi-Markov Models 

The state transition models are semi- Markov and are defined by the transition prob- 
ability and holding time distribution associated with each transition. In this study, the 
mean holding time is of interest. When p tJ - and r,y axe defined for every transition, a 
semi-Markov model is identified. Formally, p tJ is the probability that a semi-Markov 
process that entered state i on its last transition will enter state j on its next transition, 
and Tij = / 0 °° thij(t)dt is the mean holding time for the transition i —*■ j, the time the 
process will spend in state i before making a transition to state j. 

Next, there is a series of other interesting parameters that can be defined [17]. Their 
relevance will become obvious in the next section. The mean waiting time, r,-, for state 
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i, defined as 

T « = ( 2 - 1 ) 

i=i 

is the time that the process will spend in state i before making a transition. In essence, 
a waiting time is merely a holding time that is unconditioned on the destination state. 
The limiting destination probability, 7 »j, defined as 


7 ij — e tPijTij 


( 2 . 2 ) 


is the probability that at a time instant in the steady state, the process is in state i and 
planning to make its next transition to state j. 

The limiting entrance probability, e<, defined as 



is the probability of the process entering state i at any time in the steady state, inde- 
pendent of the starting state. 

The mean time between transitions, r, defined as 


r = ^7r<T,- (2.4) 

i=l 

is, in essence, a waiting time that is unconditional on the starting state. 

The limiting state probability, tt;, for the imbedded Markov process described by the 
transition probability from matrix P, is defined as 


ir = irP 


( 2 . 5 ) 
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or 

= J2 KiPn ( 2 - 6 ) 

«=i 

Equation. (2.6) along with 

E *i = 1 (2-7) 

j-i 

gives the unique nonnegative solution of 

In this study, only the transitions with i = normal resource-usage state and j = error 
state axe of interest. Unless otherwise specified, for every model parameter, i implies 
normal resource-usage states and j implies error states. 

2.4 Development of Prediction Methodology 

Under the premise that system reliability is a decreasing function of system activity, 
empirical relationships between system resource usage variables and model parameters 
are sought in order to generalize the effect of resource usage on system error rate. 

One of the promising model parameters used in this study is the limiting destination 
probability 7 r y. For each error state, the limiting destination probabilities from normal 
resource- usage states to the error states axe plotted as a function of CPU utilization 
of the normal resource- usage states, using models from the first three intervals. As an 
example, Figure 2.3(a) shows the limiting destination probability 7 ^ to software errors 
as a function of CPU utilization of all the normal resource-usage states from the first 
three intervals. Using nonlinear regression, an exponential function is fitted to this data. 
It is apparent that the limiting destination probability to the software error state is an 
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Ti 

(c) < 



Ct 

(d) 

i= normal workload state 
j = software error state 
o old environment (first three intervals) 
• new environment (fourth interval) 

regression line 

- - 95% confidence limits 


Figure 2.3: (Continued) 
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increasing function of CPU utilization. The predicted, values and the 95% confidence 
limits are shown in the figure. To determine whether models based on the first three 
intervals could accurately predict the error behavior in the fourth interval, the data 
points for the fourth interval are also shown in the same plot. The majority of the data 
from the fourth interval falls within the 95% confidence limits. Such findings are also 
true with destination probabilities to other types of error vs. CPU utilization functions. 
This indicates that models based on the measurements from the first three intervals can 
be used to predict the behavior in the fourth interval with reasonable accuracy; i.e., it 
is clear that the behavior of a new environment can be predicted by system operational 
information from old environments. 

A closer look at 7 ,-j indicates the necessity of solving the equation 

f{x,y)xy = b (2.8) 

Clearly, there is a lack of equations for a definite solution for x and y. 

The answer is to make assumptions about two of the terms. Attempts axe made to 
establish regression relations between pij and CPU utilization and between r,j and CPU 
utilization. Neither demonstrates strong uniform relations. Thus, a new model variable 
7 ij is defined as 

7 ij = ZiPijTi (2.9) 

where e,- and f; are the projected limiting entrance probability and the mean waiting time 
of the new resource-usage/error/recovery model, respectively. The projection is done by 
plotting e,- as a function of e< (Figure 2.3(d)) and r,- as a function of r/ (Figure 2.3(c)), 


is 


where e,- and r,- are the limiting entrance rate and mean waiting time, respectively, of 
the normal resource- usage states in the resource- usage/error/recovery models; e' and r/ 
are the limiting entrance rate and the mean waiting time of the normal resource-usage 
states in the resource- usage models. The new variable, 7 ^, plotted as a function of 
CPU utilization exhibits s imil ar characteristics as 7 ,y, as shown in Figure 2.3(b). By 
introducing the additional regression models, the prediction of transitions to error states 
in the new resource- usage / error /recovery model becomes possible; the transitions are >5 
in turn defined by the transition probability p,y and holding time r tJ - where i =normal 
resource- usage states and j = error states. 

Before the prediction procedures are introduced, the regression construction process 
is outlined as follows. 

1 . Collect resource- usage and fault data for different intervals of time. 

2. Identify resource-usage indices, and construct the resource-usage model and the 
resource- usage/error/recovery model for each interval. 

3. Calculate the assortment of model parameters for each resource-usage model and 
resource- usage/ error/ recovery model. 

4. Build the 7 ,y regression models by plotting 7 ,-y as functions of resource- usage for 
each error state (terror state). 

5. Explore whether the p.y’s to each error state are increasing functions of the resource- 
usage index. 
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6. Build the e, regression model by plotting e,- vs. The predicted line represents 
e,-, which is the projected limiting entrance probability of a normal resource-usage 
state in the predicted resource-usage/error/recovery model. 

7. If the pij’s are not increasing functions of the resource-usage index, build the r, 
regression model by plotting r,- vs. r/. The regression line represents f t , which 
is the projected mean waiting time of a normal resource- usage state in the pre- 
dicted resource- usage/error /recovery model. Then, calculate 7 and establish 7 • • 
regression models. 

2.5 Prediction of the Resource-Usage/Error /Recovery Model 

Using the regression models that developed above, and given the resource- usage in- 
formation of a new environment, the new resource- usage/ error /recovery model can be 
derived. The procedure to derive the new model is outlined below. 

1. Construct the resource- usage model for the new environment. 

2. Derive the p tJ ’s using the 7^ or p,- ; model. 

3. Derive the r,/s using the 7,7 model. 

4. Normalize the p^-’s such that 

= 1 

i 

5. Set pij = 7Cj where i=recovery state and normal workload states. 


20 


6. Let Tij = fi where i= recovery state and j = normal workload states. 

First, a state transition model (similar to Figure 2.1) is formed, based on the resource- 
usage inf ormation. To calculate the predicted values of from the 7' regression, the 
values t\ and r/ of the resource-usage model have to be derived. The projected e, and f, 
are then extrapolated from the e< and r,- regression models. For example, a state in the 
new model with a CPU utilization of 0.571 is used; the associated and t( are 0.0000342 
and 29223.1. The 7^’s from the resource- usage states to each error state are extrapolated 
from the 7' regression models (Figure 2.3(b)) with projected values of e t - and fi. From 
Figure 2.3(b), the 7* • to software error is 0.0358. The values of to each error state axe 
calculated from Equation (2.9) using the values for 7^, e,j, and f,. For the state in the 
example, the to software error becomes 0.393. 

Next, to derive the ry’s of the new resource- usage/error/recovery model, the 7,j’s 
from the resource-usage states to each error state are extrapolated from the 7,7 regression 
model (Figure 2.3(a)), and found to be 0.0324 for the state in the example. The values 
of pij and e t - which have just been derived from the previous two regression models, are 
then applied to Equation (2.2) to determine r,y. The value of r,j to software error for the 
state in the example is 1619.7 seconds. 

Note that the predicted p,/s from resource- usage states to error states must be nor- 
malized so that they sum to one. After normalization, p^ for the state in the above 
example is 0.2704. 
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For the transitions from the recovery and failure states to normal resource-usage 
states, the state probability ir{ of the resource-usage model best estimates the transition 
probabilities, and the mean holding time is estimated by the average of the mean waiting 
time of each recovery states in the first three intervals. 

The prediction of the transition probabilities and mean holding times from normal 
resource-usage states to error and from recovery states to normal resource-usage com- 
pletes the new resource-usage /error / recovery mcjfclel. 

2.6 Results of the Prediction Methodology 

Using the data from the first three intervals, the resource-usage/error/recovery model 
for the fourth interval is obtained. We compare the predicted transition probabilities and 
holding times of the transitions from the normal resource-usage states to the error states, 
with the ones measured for the fourth interval. 

For simplicity, the resource-usage model for the fourth interval is modeled as a one- 
state model. Using the prediction method developed in the previous sections, the fail- 
ure behavior in terms of state transition probabilities and mean holding times from 
the normal resource-usage state to error states is determined. The predicted resource- 
usage/error/failure model is shown in Figure 2.4. Figure 2.5 contains the resource- 
usage/ error/failure model constructed by the actual failure data from the fourth interval. 

Tables 2.4 and 2.5 list the predicted and actual transition probabilities and mean 
holding times from the normal resource- usage state to error states. To quantify the 
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Figure 2.4: Predicted Resource-Usage/Error/Recovery Model for the Fourth Interval 


1 


to resource-usage states 
Figure 2.5: Actual Resource- Usage/Error/Recovery Model for the Fourth Interval 


























23 


Table 2.4: Transition Probabilities 


dest. 

no. obs 

expected 

predicted 

diff. 

90% 

99% 

Channel 

1 


0.02578 

0.02035 

0.00892 

0.01397 

Mult 

27 

0.14674 

mmms i 

0.01325 

0.02609 

0.04086 

Disk 

111 

0.60326 

0.54388 

-0.05938 

0.05933 


Software 

45 

0.24457 

0.27036 

0.02579 

0.05213 

0.8163 


Table 2.5: Mean Holding Time 


dest. 

no. obs 

expected 

predicted 

diff. 

90% 

99% 

Channel 

1 

2210.68 

1164.77 

-1045.91 

. 

. 

Mult 

27 

1550.90 

1430.19 

-120.71 

517.75 

EUatfl! 

Disk 

111 

1172.86 

1421.95 


176.04 

275.67 

Software 

45 

1708.40 

1691.69 

-16.71 

547.70 

857.67 


i 

goodness of the prediction method, consider the confidence intervals for the expected 
Pij ’s and r.j’s. Since the expected p.j’s and r./s axe estimates, there axe confidence 
intervals associated with them. When the absolute difference between the predicted and 
expected is less than the x% confidence interval, the predicted value is said to fall within 
x%. For example, for the transition probability to disk error, the absolute difference is 
0.05938 is slightly greater than 0.05933, but less than 0.09091. Thus, the predicted value 
is close to the 90% interval. For the most part, the predicted values fall within the 90% 
interval. The large difference for channel error is due to the fact that the actual number 
of transitions to channel error is statistically insignificant. 
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3. A SECOND EXAMPLE 


In this chapter the prediction methodology is further illustrated by applying it to a 
multicomputer system configuration. The data set used to develop the method in the 
previous chapter is referred to as Data Set I, and the new data set is referred to as Data 
Set II. 

3.1 Data 

Data Set II contains a system resource-usage and failure log collected on two IBM 
370/168 mainframes over a period of three years [10]. Again the resource usage is mea- 
sured by the SMF (System Management Facility). The recorded indices include CPU 
load, batch memory requests and usage, batch 1/ 0 wait time and load, and batch paging 
in and out. For every hour, the average values for each index are computed to depict 
the system resource usage for that hour. The clustering algorithm is later applied on 
the hourly averages. Again, the dependence of system reliability on CPU usage is inves- 
tigated. The recorded CPU-related indices are TOTCPU, a measure of the total CPU 
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usage as a fraction between 0 and 1, and BATCPU, a measure of the batch CPU usage as 
a fraction between 0 and 1. An additional index, SYSCPU, a measure of the interactive 
and system workload, is calculated as the difference between TOTCPU and BATCPU. ' 
The failure data, recorded by the built-in failure detection facilities, contain error oc- 
currences and recovery attempts. Errors occurring within five minutes of each other are 
coalesced into one error to eliminate the many manifestations of a single fault. As a re- 
sult, a new type of error is created to account for different types of errors occurring close 
in time. Thus, four types of errors are defined: cpu-related errors in the central proces- 
sor and storage (MCH), channel-related errors in I/O channels and associated interfaces 
(CCH), other errors in addition to MCH and CCH (OTH), and different types of error 
occurring in close succession (MUL). The choice of six clusters represents a compromise 
between two conflicting requirements. For regression, it is desirable to have a significant 
number of intervals (greater than seven). From the clustering point of view, the number 
of observations in each cluster has to be substantial for meaningful clustering results. 
The decision of six intervals is based on trial-and-error. Table 3.1 shows Data Set II 
divided into six intervals. 

3.2 Modeling 

First, the resource-usage data from the first five intervals axe clustered by BATCPU 
and SYSCPU. Data from each interval are clustered into five clusters. Each of the 
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Table 3.1: IBM370 Data Intervals 
interval span no. of obs. no. of err. 


1 

2/79 

- 6/79 

3600 

498 

2 

7/79 

- 12/79 

4416 

311 

3 

1/80 

- 6/80 

4368 

313 

4 

7/80 

- 12/80 

4416 

721 

5 

1/81 

- 6/81 

4366 

730 

6 

7/81 

- 12/81 

4416 

338 


resource-usage models is therefore a five-state model, with each state varying in mean 
BATCPU and mean SYSCPU. 

Next, the corresponding resource-usage/error/recovery models are constructed. To 
simplify the model, recovery states are combined with the error states. In other words, 
instead of malting transitions to recovery states and then to resource-usage states, the 
error states make transitions directly to the resource-usage states. Each of the resource- 
usage/error/recovery models, therefore, has a total of nine states — four error /recovery 
states and five normal resource- usage states. 

After all of the resource-usage and failure models are constructed, the necessary model 
parameters are calculated for regression purposes. Again, the parameters of interest are 
those with i = each resource- usage state and j = each error state. 

3.3 Regression 

As a first step in finding regression relations, the 7,/s from the first five intervals are 
plotted versus CPU usage for j = {MCH, CCH, OTH, MUL} (not shown). It is observed 
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that, for j = {MCH, CCH}, 7 y vs. TOTCPU is the most pronounced increasing function, 
and for j = {OTH, MUL}, 7,7 vs. SYSCPU is the most pronounced increasing function. 
Regression relations are then established. Examples are shown in (Figure 3.1). 

To solve the f(x,y)xy = b dilemma, py vs. CPU usage relations axe plotted and 
found to be more pronounced increasing functions of CPU usage than 7 - vs. CPU usage 
relations. Thus, py vs. CPU usages regression relations are established for the four error 
states. In addition, the - regression relation is determined. 

3.4 Prediction 

To predict the resource-usage/error/recovery model for the last interval, the resource- 
usage model is constructed as a five-state model. Using the regression relations deter- 
mined in the previous section, py and Ty from the resource-usage states to the error 
states can be predicted. Transition probabilities py’s are determined directly from the 
the pij regression models, and mean holding times ry’s axe determined by the ey and 
7 y regression relations along with the calculated py’s. The resulting values for py’s are 
shown in Table 3.2, and for fy’s in Table 3.3. 

3.5 Discussion 

Observe Tables 3.2 and 3.3. For both the py ’s and r./s, most of the predicted values 
fall within the 90% confidence interval. All of the predicted values fall within the 99% 
confidence interval with very few exceptions. All of the exceptions axe transitions of very 


SYSCPU 


(b) 7 ij for j =MUL 


Figure 3.1: Limiting Destination Probability from Normal Resource- Usage States to Er- 
ror States 
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Table 3.2: Expected vs. Predicted p,j for i = Resource-Usage States and j = Error States 
for the Sixth Interval 


j 

i 

obs. 

exp. pij 

pred. pa 

diff. 

90% 

99% 

MCH 

1 

13 

0.0253 

0.0136 

-0.0118 

0.0114 

0.0179 

H 

3 

0.0110 

0.0133 

0.0023 

0.0104 

0.0163 

H 

11 

0.0266 

0.0156 

Emm in 

0.0130 

0.0204 

B 

1 

0.0038 

0.0189 

0.0151 

0.0062 

0.0097 

B 

5 

0.0144 

0.0072 

Eflfilirei 

0.0105 

0.0165 

CCH 

i 

35 

0.0682 

0.0724 

EffilESl 

■UrtKKl 

0.0287 

□ 

29 

0.1062 

0.0633 

EiliESTil 

0.0307 

0.0480 

y 

46 

0.1114 

0.0714 

MiKiHiTil 

0.0255 

0.0399 

B 

24 

0.0906 

0.0797 

jgiiimia 

0.0290 

0.0454 

B 

22 

0.0634 

0.0606 

Emiomi 

0.0215 

0.0337 

OTH 

i 

14 

0.0273 

0.0427 

0.0154 

0.0118 

0.0185 

B 

50 

0.1832 

0.1897 

0.0065 

0.0385 

0.0603 

B 

47 

0.1138 

0.1050 

Emuuan 

0.0257 

0.0403 

B 

3 

0.0113 

0.0347 

0.0234 

0.0107 

0.0167 

B 

23 

0.0663 

0.0439 

mnsmm 

0.0220 

0.0344 

MUL 

i 

4 

0.0078 

0.0041 

Emuw 

0.0064 

0.0100 

Q 

3 

0.0110 

0.0134 

0.0024 

0.0104 

0.0163 

B 

4 

0.0097 

0.0084 

Emumn 

0.0079 

0.0124 

4 

0 

0.0000 

0.0035 

. 

. 

. 

B 

1 

0.0029 

0.0042 

0.0014 

0.0047 

0.0074 
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Table 3.3: Expected vs. Predicted r,j for i = Resource-Usage States and j = Error States 
for the Sixth Interval 


j 

T 

obs. 

exp. Tij 


diff. 

90% 

99% 

_ 

MCH 

a 

13 

2736.4 

3319.1 

582.8 

1775.5 

2780.4 

Q 

3 

2587.3 

6456.2 

3868.9 

2816.0 

4409.7 

B 

11 

4752.4 

4471.9 

-280.5 

2628.2 

4115.7 

B 

1 

373.0 

5372.2 

4999.2 

• 

. 

B 

5 

3862.6 

5130.3 

1267.7 

5093.4 

7976.0 

CCH 

B 

KJ 

4289.1 

4360.0 

HI 

1080.3 

1691.7 

H 

29 

6629.9 

9226.6 

2596.7 

2529.6 

3961.2 

B 

46 

4938.6 


1623.5 

2001.9 

3134.9 

4 

24 

14985.6 

8353.0 

-6632.6 

7968.2 

12477.9 

a 

22 

8561.1 

4846.4 

-3714.7 

3486.5 

5459.7 

OTH 

B 

1 

2662.2 

4344.2 

1682.0 

1526.9 

2391.0 

B 

50 

7107.5 

5807.6 

-1300.0 

2372.6 

3715.4 

B 

4* 

4251.8 

4776.6 

524.9 

1256.4 

1967.5 

B 

3 

1175.0 

7711.6 

6536.6 

1637.2 

2563.8 

B 

23 

5349.8 

6152.4 

802.6 

3396.1 

5318.1 

MUL 

i 

4 

1328.2 

4271.6 

2943.4 

1216.6 

1905.2 

B 

3 

3474.7 

5534.8 

2060.1 

1964.5 

3076.3 

B 

4 

11325.5 

4611.8 

liirrtM 

9661.4 

15129.3 

4 

0 

0.0 

7611.8 

7611.8 

. 

. 

B 

1 

980.0 

6048.9 

5068.9 

. 

. 
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few observations which again make the result statistically insignificant. For example, for 
both pi ]' s and r,/s, the exceptions are (i= 4, ;=MCH), one observation, (i=4, j=OTH), 
three observations, and (i= 4, j=MUL), zero observations. 

The inaccuracies in prediction are explained by the long time span of an interval and 
by the low error rate. In general, the longer the time that a data set spans, the more 
variation is contained in the data set. Simply stated, exceptional phenomena are much 
more likely to occur over a period of three years than over three months. Low error 
rate makes the regression less statistically significant. Data Set I is a better candidate 
for the method, because Data Set I has an average of 323 errors per month while Data 
Set II has 83 errors per month. The inaccuracies occur when the transitions are few, 
as discussed in the previous two paragraphs. When errors are rare or infrequent, any 
statistical inferences made based on the few observations can be unsound. Predicting -the 
rare and infrequent is almost impossible and probably unrewarding. 
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4. CONCLUSIONS 


In the past, analytical and measurement-based models have been developed to char- 
acterize system behavior. An open issue has been how these models can be used, if at all, 
for system design improvement or system timing. This thesis has attempted to address 
this issue. Past studies have shown the resource-usage dependency of system failure be- 
havior. The current study shows that it is possible to predict the system error/failure rate 
when given resource-usage information. Previous fault prediction schemes have lacked 
the use of real measurements; with the proposed method, measurement- based modeling 
finds its place in fault prediction application. 

This thesis has proposed a combined statistical/analytical approach to use measure- 
ments from existing environments to forecast the system failure behavior in a new envi- 
ronment. Using regression, the method makes generalizations about the system failure 
behavior from models extracted from the measurements of the existing environments. 
When the resource usage of a new environment is known, predictions can be made from 
the generalized system failure behavior. Comparisons of the predicted results with the 
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actual data from the new environment show a close correspondence for both data sets on 
which the method has been applied. 

However, the method is somewhat data-sensitive, while mostly system-independent. 
Improved accuracy is possible when the errors axe frequent. In the case of very low error 
rates, the method attempts the almost impossible — to predict accidental and infrequent 
events, and hence may not succeed. 

During the course of the research, many fascinating ideas that have arisen which 
deserve future investigation. Some of these axe 

1. Other standardizing variables in addition to the steady-state destination probability 
can be explored. 

2. The effect of other measured resource-usage variables can be studied. 


3. Regression on multiple resource-usage indices can be explored. 
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